{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "4ff7339a",
   "metadata": {},
   "source": [
    "#  Notebook: Chating with NVIDIA Financial Reports\n",
    "\n",
    " In this notebook, we are going to use milvus as vectorstore, the **mixtral_8x7b as LLM** and **nvolveqa_40k embedding** provided by [NVIDIA_AI_Endpoint](https://python.langchain.com/docs/integrations/text_embedding/nvidia_ai_endpoints) as LLM and embedding model, and build a simply RAG example for chatting with NVIDIA Financial Reports.\n",
    "\n",
    "\n",
    "NVIDIA financial reports are available pubicly in nvidianews. \n",
    "\n",
    "Below is an example of financial report in Fiscal Year 2024 Q1 \n",
    "\n",
    "https://nvidianews.nvidia.com/news/nvidia-announces-financial-results-for-first-quarter-fiscal-2024\n",
    "\n",
    "<img src=\"imgs/nvidianews.png\" alt=\"drawing\" width=\"800\"/>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "612375a9",
   "metadata": {},
   "source": [
    "### Step 1  - Export the NVIDIA_API_KEY\n",
    "Supply the NVIDIA_API_KEY in this notebook when you run the cell below"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d479e614",
   "metadata": {},
   "outputs": [],
   "source": [
    "import getpass\n",
    "import os\n",
    "if os.environ.get(\"NVIDIA_API_KEY\", \"\").startswith(\"nvapi-\"):\n",
    "    print(\"Valid NVIDIA_API_KEY already in environment. Delete to reset\")\n",
    "else:\n",
    "    nvapi_key = getpass.getpass(\"NVAPI Key (starts with nvapi-): \")\n",
    "    assert nvapi_key.startswith(\"nvapi-\"), f\"{nvapi_key[:5]}... is not a valid key\"\n",
    "    os.environ[\"NVIDIA_API_KEY\"] = nvapi_key"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5b4afb52",
   "metadata": {},
   "source": [
    "### Step 2 - initialize the LLM and Embedding Model\n",
    "Here we will use **mixtral_8x7b** "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5d3cb734",
   "metadata": {},
   "outputs": [],
   "source": [
    "# test run and see that you can genreate a respond successfully \n",
    "from langchain_nvidia_ai_endpoints import ChatNVIDIA,NVIDIAEmbeddings\n",
    "llm = ChatNVIDIA(model=\"ai-mixtral-8x7b-instruct\", nvidia_api_key=nvapi_key, max_tokens=1024)\n",
    "from langchain.vectorstores import Milvus\n",
    "import torch\n",
    "import time\n",
    "embedder_document = NVIDIAEmbeddings(model=\"ai-embed-qa-4\", model_type=\"passage\")\n",
    "embedder_query = NVIDIAEmbeddings(model=\"ai-embed-qa-4\", model_type=\"query\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d2104106",
   "metadata": {},
   "source": [
    "### Step 3 - Ingest http files"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a4e45a22-c883-40cc-b2f4-e6f51866e52b",
   "metadata": {},
   "source": [
    "#### 3.1 Download http files covering financial reports from Fiscal year 2020 to 2024"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "31699728",
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "\n",
    "urls_content = []\n",
    "\n",
    "url_template1 = \"https://nvidianews.nvidia.com/news/nvidia-announces-financial-results-for-{quarter}-quarter-fiscal-{year}\"\n",
    "url_template2 = \"https://nvidianews.nvidia.com/news/nvidia-announces-financial-results-for-{quarter}-quarter-and-fiscal-{year}\"\n",
    "\n",
    "for quarter in [\"first\", \"second\", \"third\", \"fourth\"]:\n",
    "    for year in range(2020,2025): \n",
    "        args = {\"quarter\":quarter, \"year\": str(year)}\n",
    "        if quarter == \"fourth\":\n",
    "            urls_content.append(requests.get(url_template2.format(**args)).content)\n",
    "        else:\n",
    "            urls_content.append(requests.get(url_template1.format(**args)).content)\n",
    "        \n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8cbada93-6b0c-49df-a6b9-781f1a2400fb",
   "metadata": {},
   "source": [
    "#### 3.2 Parse html files"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "67a1b247-8899-4a11-b055-7ac6d0495755",
   "metadata": {},
   "outputs": [],
   "source": [
    "# extract the url, title, text content, and tables in the html\n",
    "from bs4 import BeautifulSoup\n",
    "import markdownify\n",
    "def extract_url_title_time(soup):\n",
    "    url = \"\"\n",
    "    title = \"\"\n",
    "    revised_time = \"\"\n",
    "    tables = []\n",
    "    try:\n",
    "        if soup.find(\"title\"):\n",
    "            title = str(soup.find(\"title\").string)\n",
    "\n",
    "        og_url_meta = soup.find(\"meta\", property=\"og:url\")\n",
    "        if og_url_meta:\n",
    "            url = og_url_meta.get(\"content\", \"\")\n",
    "\n",
    "        for table in soup.find_all(\"table\"):\n",
    "            tables.append(markdownify.markdownify(str(table))) \n",
    "            table.decompose()\n",
    "\n",
    "        text_content = soup.get_text(separator=' ', strip=True)\n",
    "        text_content = ' '.join(text_content.split())\n",
    "\n",
    "        return url, title,text_content, tables\n",
    "    except:\n",
    "        print(\"parse error\")\n",
    "        return \"\", \"\", \"\", \"\", []\n",
    "        \n",
    "parsed_htmls = []\n",
    "for url_content in urls_content:\n",
    "    soup = BeautifulSoup(url_content, 'html.parser')\n",
    "    url, title, content, tables = extract_url_title_time(soup)\n",
    "    parsed_htmls.append({\"url\":url, \"title\":title, \"content\":content, \"tables\":tables})\n",
    "\n",
    "\n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d985444a-8ef2-47d7-ba13-1a4da369f468",
   "metadata": {},
   "source": [
    "#### 3.3 Summarize tables"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8bb88c6c-8167-4b6e-9b83-5a6937f9f9fb",
   "metadata": {},
   "outputs": [],
   "source": [
    "# summarize tables \n",
    "def get_table_summary(table, title, llm):\n",
    "    res = \"\"\n",
    "    try:\n",
    "        #table = markdownify.markdownify(table)\n",
    "        prompt = f\"\"\"\n",
    "                    [INST] You are a virtual assistant.  Your task is to understand the content of TABLE in the markdown format.\n",
    "                    TABLE is from \"{title}\".  Summarize the information in TABLE into SUMMARY. SUMMARY MUST be concise. Return SUMMARY only and nothing else.\n",
    "                    TABLE: ```{table}```\n",
    "                    Summary:\n",
    "                    [/INST]  \n",
    "                \"\"\"\n",
    "        result = llm.invoke(prompt)\n",
    "        res = result.content\n",
    "    except Exception as e:\n",
    "        print(f\"Error: {e} while getting table summary from LLM\")\n",
    "        if not os.getenv(\"NVIDIA_API_KEY\", False):\n",
    "            print(\"NVIDIA_API_KEY not set\")\n",
    "        pass\n",
    "    finally:\n",
    "        return res\n",
    "\n",
    "\n",
    "for parsed_item in parsed_htmls:\n",
    "    title = parsed_item['title']\n",
    "    for idx, table in enumerate(parsed_item['tables']):\n",
    "        print(f\"parsing tables in {title}...\")\n",
    "        table = get_table_summary(table, title, llm)\n",
    "        parsed_item['tables'][idx] = table\n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ccfd4607-3479-4cdd-b120-1f416959cb23",
   "metadata": {},
   "source": [
    "#### 3.4 Split the text/table in chunks, extract embedding for each chunk, and store the embeddinges into milvus vectordb"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b3e3ae7e-ef19-41df-b566-1bcd7bc77978",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.docstore.document import Document\n",
    "from langchain.text_splitter import SentenceTransformersTokenTextSplitter\n",
    "TEXT_SPLITTER_MODEL = \"intfloat/e5-large-v2\"\n",
    "TEXT_SPLITTER_CHUNCK_SIZE = 200\n",
    "TEXT_SPLITTER_CHUNCK_OVERLAP = 50\n",
    "\n",
    "text_splitter = SentenceTransformersTokenTextSplitter(\n",
    "    model_name=TEXT_SPLITTER_MODEL,\n",
    "    tokens_per_chunk=TEXT_SPLITTER_CHUNCK_SIZE,\n",
    "    chunk_overlap=TEXT_SPLITTER_CHUNCK_OVERLAP,\n",
    ")\n",
    "\n",
    "documents = []\n",
    "\n",
    "for parsed_item in parsed_htmls:\n",
    "    title = parsed_item['title']\n",
    "    url =  parsed_item['url']\n",
    "    text_content = parsed_item['content']\n",
    "    documents.append(Document(page_content=text_content, metadata = {'title':title, 'url':url}))\n",
    "    \n",
    "    for idx, table in enumerate(parsed_item['tables']):\n",
    "        table_content = table\n",
    "        documents.append(Document(page_content=table, metadata = {'title':title, 'url':url}))\n",
    "        \n",
    "documents = text_splitter.split_documents(documents)\n",
    "print(f\"obtain {len(documents)} chunks\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9319cfcb-ba59-4ffd-9df5-12ca11f9097c",
   "metadata": {},
   "outputs": [],
   "source": [
    "COLLECTION_NAME = \"NVIDIA_Finance\"\n",
    "from langchain.vectorstores import Milvus\n",
    "vectorstore = Milvus(\n",
    "    embedding_function=embedder_document,\n",
    "    collection_name=COLLECTION_NAME,\n",
    "    connection_args={\n",
    "        \"host\": \"milvus\", \n",
    "        \"port\": \"19530\"},\n",
    "    drop_old = True,\n",
    "    auto_id = True\n",
    "    )\n",
    "vectorstore.add_documents(documents)\n",
    "docs = vectorstore.similarity_search(\"what are 2024 Q3 revenues? \")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a03406c1",
   "metadata": {},
   "source": [
    "### Step 4  Retrieve and Generate Answer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d367584c",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "from langchain.prompts.prompt import PromptTemplate\n",
    "\n",
    "PROMPT_TEMPLATE = \"\"\"[INST]You are a friendly virtual assistant and maintain a conversational, polite, patient, friendly and gender neutral tone throughout the conversation.\n",
    "\n",
    "Your task is to understand the QUESTION, read the Content list from the DOCUMENT delimited by ```, generate an answer based on the Content, and provide references used in answering the question in the format \"[Title](URL)\".\n",
    "Do not depend on outside knowledge or fabricate responses.\n",
    "DOCUMENT: ```{context}```\n",
    "\n",
    "Your response should follow these steps:\n",
    "\n",
    "1. The answer should be short and concise, clear.\n",
    "    * If detailed instructions are required, present them in an ordered list or bullet points.\n",
    "2. If the answer to the question is not available in the provided DOCUMENT, ONLY respond that you couldn't find any information related to the QUESTION, and do not show references and citations.\n",
    "3. Citation\n",
    "    * ALWAYS start the citation section with \"Here are the sources to generate response.\" and follow with references in markdown link format [Title](URL) to support the answer.\n",
    "    * Use Bullets to display the reference [Title](URL).\n",
    "    * You MUST ONLY use the URL extracted from the DOCUMENT as the reference link. DO NOT fabricate or use any link outside the DOCUMENT as reference.\n",
    "    * Avoid over-citation. Only include references that were directly used in generating the response.\n",
    "    * If no reference URL can be provided, remove the entire citation section.\n",
    "    * The Citation section can include one or more references. DO NOT include same URL as multiple references. ALWAYS append the citation section at the end of your response.\n",
    "    * You MUST follow the below format as an example for this citation section:\n",
    "      Here are the sources used to generate this response:\n",
    "      * [Title](URL)\n",
    "[/INST] \n",
    "[INST]\n",
    "QUESTION: {question} \n",
    "FINAL ANSWER:[/INST]\"\"\"\n",
    "\n",
    "prompt_template = PromptTemplate(template=PROMPT_TEMPLATE, input_variables=[\"context\", \"question\"])\n",
    "\n",
    "\n",
    "\n",
    "def build_context(chunks):\n",
    "    context = \"\"\n",
    "    for chunk in chunks:\n",
    "        context = context + \"\\n  Content: \" + chunk.page_content + \" | Title: (\" + chunk.metadata[\"title\"] + \") | URL: (\" + chunk.metadata.get(\"url\", \"source\") + \")\"\n",
    "    return context\n",
    "    \n",
    "\n",
    "def generate_answer(llm, vectorstore, prompt_template, question):\n",
    "    retrieved_chunks = vectorstore.similarity_search(question)    \n",
    "    context = build_context(retrieved_chunks)\n",
    "    args = {\"context\":context, \"question\":question}\n",
    "    prompt = prompt_template.format(**args)\n",
    "    ans = llm.invoke(prompt)\n",
    "    return ans.content\n",
    "    \n",
    "    \n",
    "question = \"what are 2024 Q1 revenues?\"\n",
    "\n",
    "generate_answer(llm, vectorstore, prompt_template, question)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
